Vinho Verde wine quality
# CHOOSE FILE TO ANALYSE: Please use 'Red_Wine.csv'
redwine <- read.csv("Red_Wine.csv", header = TRUE, sep = ",")
Our winery in the Northwest of Portugal produces Vinho Verde wine. More recently we have seen the quality of our red wine decline, each year losing spots at the internationally acclaimed “Best of Vinho Verde awards”.
We are wondering if there is anything we could do to improve the quality and rating of the wine by impacting the production process and influence some physio-chemical characteristics of the wine. Our rational is that the grapes are the same as our competitors so there is certainly room for improvement. We have a limited budget to work on the improvement of our wine and would like to understand which components matter most and how to invest to improve the production of better wine.
We gathered data from 1599 red Vinho Verde wines with their measurable characteristics:
The basic structure of the data is as follows:
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
A summary of the data can be found below:
| fixed.acidity | volatile.acidity | citric.acid | residual.sugar | chlorides | free.sulfur.dioxide | total.sulfur.dioxide | density | pH | sulphates | alcohol | quality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 4.60 | Min. :0.1200 | Min. :0.000 | Min. : 0.900 | Min. :0.01200 | Min. : 1.00 | Min. : 6.00 | Min. :0.9901 | Min. :2.740 | Min. :0.3300 | Min. : 8.40 | Min. :3.000 | |
| 1st Qu.: 7.10 | 1st Qu.:0.3900 | 1st Qu.:0.090 | 1st Qu.: 1.900 | 1st Qu.:0.07000 | 1st Qu.: 7.00 | 1st Qu.: 22.00 | 1st Qu.:0.9956 | 1st Qu.:3.210 | 1st Qu.:0.5500 | 1st Qu.: 9.50 | 1st Qu.:5.000 | |
| Median : 7.90 | Median :0.5200 | Median :0.260 | Median : 2.200 | Median :0.07900 | Median :14.00 | Median : 38.00 | Median :0.9968 | Median :3.310 | Median :0.6200 | Median :10.20 | Median :6.000 | |
| Mean : 8.32 | Mean :0.5278 | Mean :0.271 | Mean : 2.539 | Mean :0.08747 | Mean :15.87 | Mean : 46.47 | Mean :0.9967 | Mean :3.311 | Mean :0.6581 | Mean :10.42 | Mean :5.636 | |
| 3rd Qu.: 9.20 | 3rd Qu.:0.6400 | 3rd Qu.:0.420 | 3rd Qu.: 2.600 | 3rd Qu.:0.09000 | 3rd Qu.:21.00 | 3rd Qu.: 62.00 | 3rd Qu.:0.9978 | 3rd Qu.:3.400 | 3rd Qu.:0.7300 | 3rd Qu.:11.10 | 3rd Qu.:6.000 | |
| Max. :15.90 | Max. :1.5800 | Max. :1.000 | Max. :15.500 | Max. :0.61100 | Max. :72.00 | Max. :289.00 | Max. :1.0037 | Max. :4.010 | Max. :2.0000 | Max. :14.90 | Max. :8.000 |
The correlation of the data is shown in the graph below
We are thus exploring whether there is a link between purely physio-chemical characteristics and perceived quality of the wines in order to tailor our production process to consumer preferences in order to achieve consistent quality of our wines. To conduct this analysis, we plan to execute the following steps:
Once we have the best model, we will be able to know the characteristics that make a great wine. We will be able to adapt our production process in order to produce a wine that will win awards, use our investment resources in the most effective way and that will appeal to the customers.
Understanding out wine data is one of the major activities during the wine data analysis. Understanding our wine data deals with detecting and removing errors and inconsistencies from the wine data in order to improve the quality of data. It will also play major role during decision-making process.
A good approach should satisfy several requirements. First of all, we have a dictionary where all the variables are explained. Then, we should detect and remove all major errors and inconsistencies from the wine data. This approach is supported by R tools to limit manual inspection and programming effort.
The first step is to ensure that the data read from the CSV file is read in the right format.
redwine$fixed.acidity <- as.numeric(redwine$fixed.acidity)
redwine$volatile.acidity <- as.numeric(redwine$volatile.acidity)
redwine$citric.acid <- as.numeric(redwine$citric.acid)
redwine$residual.sugar <- as.numeric(redwine$residual.sugar)
redwine$chlorides <- as.numeric(redwine$chlorides)
redwine$free.sulfur.dioxide <- as.numeric(redwine$free.sulfur.dioxide)
redwine$total.sulfur.dioxide <- as.numeric(redwine$total.sulfur.dioxide)
redwine$density <- as.numeric(redwine$density)
redwine$pH <- as.numeric(redwine$pH)
redwine$sulphates <- as.numeric(redwine$sulphates)
redwine$alcohol <- as.numeric(redwine$alcohol)
redwine$quality <- as.integer(redwine$quality)
The data is now in the right format, that’s fantastic!
The wine data looks like this:
| Wine 1 | Wine 2 | Wine 3 | Wine 4 | Wine 5 | Wine 6 | Wine 7 | Wine 8 | Wine 9 | Wine 10 | |
|---|---|---|---|---|---|---|---|---|---|---|
| fixed.acidity | 7.40 | 7.80 | 7.80 | 11.20 | 7.40 | 7.40 | 7.90 | 7.30 | 7.80 | 7.50 |
| volatile.acidity | 0.70 | 0.88 | 0.76 | 0.28 | 0.70 | 0.66 | 0.60 | 0.65 | 0.58 | 0.50 |
| citric.acid | 0.00 | 0.00 | 0.04 | 0.56 | 0.00 | 0.00 | 0.06 | 0.00 | 0.02 | 0.36 |
| residual.sugar | 1.90 | 2.60 | 2.30 | 1.90 | 1.90 | 1.80 | 1.60 | 1.20 | 2.00 | 6.10 |
| chlorides | 0.08 | 0.10 | 0.09 | 0.08 | 0.08 | 0.08 | 0.07 | 0.06 | 0.07 | 0.07 |
| free.sulfur.dioxide | 11.00 | 25.00 | 15.00 | 17.00 | 11.00 | 13.00 | 15.00 | 15.00 | 9.00 | 17.00 |
| total.sulfur.dioxide | 34.00 | 67.00 | 54.00 | 60.00 | 34.00 | 40.00 | 59.00 | 21.00 | 18.00 | 102.00 |
| density | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.99 | 1.00 | 1.00 |
| pH | 3.51 | 3.20 | 3.26 | 3.16 | 3.51 | 3.51 | 3.30 | 3.39 | 3.36 | 3.35 |
| sulphates | 0.56 | 0.68 | 0.65 | 0.58 | 0.56 | 0.56 | 0.46 | 0.47 | 0.57 | 0.80 |
| alcohol | 9.40 | 9.80 | 9.80 | 9.80 | 9.40 | 9.40 | 9.40 | 10.00 | 9.50 | 10.50 |
| quality | 5.00 | 5.00 | 5.00 | 6.00 | 5.00 | 5.00 | 5.00 | 7.00 | 7.00 | 5.00 |
The basic structure of the data is as follows:
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
A summary for each variable can be found below:
| fixed.acidity | volatile.acidity | citric.acid | residual.sugar | chlorides | free.sulfur.dioxide | total.sulfur.dioxide | density | pH | sulphates | alcohol | quality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 4.60 | Min. :0.1200 | Min. :0.000 | Min. : 0.900 | Min. :0.01200 | Min. : 1.00 | Min. : 6.00 | Min. :0.9901 | Min. :2.740 | Min. :0.3300 | Min. : 8.40 | Min. :3.000 | |
| 1st Qu.: 7.10 | 1st Qu.:0.3900 | 1st Qu.:0.090 | 1st Qu.: 1.900 | 1st Qu.:0.07000 | 1st Qu.: 7.00 | 1st Qu.: 22.00 | 1st Qu.:0.9956 | 1st Qu.:3.210 | 1st Qu.:0.5500 | 1st Qu.: 9.50 | 1st Qu.:5.000 | |
| Median : 7.90 | Median :0.5200 | Median :0.260 | Median : 2.200 | Median :0.07900 | Median :14.00 | Median : 38.00 | Median :0.9968 | Median :3.310 | Median :0.6200 | Median :10.20 | Median :6.000 | |
| Mean : 8.32 | Mean :0.5278 | Mean :0.271 | Mean : 2.539 | Mean :0.08747 | Mean :15.87 | Mean : 46.47 | Mean :0.9967 | Mean :3.311 | Mean :0.6581 | Mean :10.42 | Mean :5.636 | |
| 3rd Qu.: 9.20 | 3rd Qu.:0.6400 | 3rd Qu.:0.420 | 3rd Qu.: 2.600 | 3rd Qu.:0.09000 | 3rd Qu.:21.00 | 3rd Qu.: 62.00 | 3rd Qu.:0.9978 | 3rd Qu.:3.400 | 3rd Qu.:0.7300 | 3rd Qu.:11.10 | 3rd Qu.:6.000 | |
| Max. :15.90 | Max. :1.5800 | Max. :1.000 | Max. :15.500 | Max. :0.61100 | Max. :72.00 | Max. :289.00 | Max. :1.0037 | Max. :4.010 | Max. :2.0000 | Max. :14.90 | Max. :8.000 |
The second step is to ensure that the data read from the CSV file does not contain any missing values. As shown in the summary data above, it does not contain any missing values. Thus, we can proceed with the study.
We plot the histograms of the data provided:
Now, we will see in more detail the correlation of each variable. It will be interesting for the project. The correlation matrix is:
| fixed.acidity | volatile.acidity | citric.acid | residual.sugar | chlorides | free.sulfur.dioxide | total.sulfur.dioxide | density | pH | sulphates | alcohol | quality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| fixed.acidity | 1.00 | -0.26 | 0.67 | 0.11 | 0.09 | -0.15 | -0.11 | 0.67 | -0.68 | 0.18 | -0.06 | 0.12 |
| volatile.acidity | -0.26 | 1.00 | -0.55 | 0.00 | 0.06 | -0.01 | 0.08 | 0.02 | 0.23 | -0.26 | -0.20 | -0.39 |
| citric.acid | 0.67 | -0.55 | 1.00 | 0.14 | 0.20 | -0.06 | 0.04 | 0.36 | -0.54 | 0.31 | 0.11 | 0.23 |
| residual.sugar | 0.11 | 0.00 | 0.14 | 1.00 | 0.06 | 0.19 | 0.20 | 0.36 | -0.09 | 0.01 | 0.04 | 0.01 |
| chlorides | 0.09 | 0.06 | 0.20 | 0.06 | 1.00 | 0.01 | 0.05 | 0.20 | -0.27 | 0.37 | -0.22 | -0.13 |
| free.sulfur.dioxide | -0.15 | -0.01 | -0.06 | 0.19 | 0.01 | 1.00 | 0.67 | -0.02 | 0.07 | 0.05 | -0.07 | -0.05 |
| total.sulfur.dioxide | -0.11 | 0.08 | 0.04 | 0.20 | 0.05 | 0.67 | 1.00 | 0.07 | -0.07 | 0.04 | -0.21 | -0.19 |
| density | 0.67 | 0.02 | 0.36 | 0.36 | 0.20 | -0.02 | 0.07 | 1.00 | -0.34 | 0.15 | -0.50 | -0.17 |
| pH | -0.68 | 0.23 | -0.54 | -0.09 | -0.27 | 0.07 | -0.07 | -0.34 | 1.00 | -0.20 | 0.21 | -0.06 |
| sulphates | 0.18 | -0.26 | 0.31 | 0.01 | 0.37 | 0.05 | 0.04 | 0.15 | -0.20 | 1.00 | 0.09 | 0.25 |
| alcohol | -0.06 | -0.20 | 0.11 | 0.04 | -0.22 | -0.07 | -0.21 | -0.50 | 0.21 | 0.09 | 1.00 | 0.48 |
| quality | 0.12 | -0.39 | 0.23 | 0.01 | -0.13 | -0.05 | -0.19 | -0.17 | -0.06 | 0.25 | 0.48 | 1.00 |
And the graphical representation is:
From the matrix and plot above, we can derive that:
And now we plot the rest of the variables against the quality:
In this section, we plot other graphs to see how the data is distributed based on the correlation matrix
The main objective of our exercise is to help identify poor quality wine based on its chemical attributes. Poor quality, or faulty wines, have been defined in our dataset based on their quality:
Wines with a quality below 5 are considered bad wines - we do not want to sell these wines to the public. This category is of primary interest in our study.
In addition to this, there is a clear distinction between the number of wines classified with quality 5 & 6 and between the number of wines classified as over 6. Thus, a good way of classification is as follows: average wines are those wines whose quality is between 5 and 6 and good wines are those whose quality is above 6.
| Classification | Quality | # Occurrences in estimation data | ||
|---|---|---|---|---|
| Faulty | 4 or less | 63 (4%) | ||
| Average | 5 & 6 | 1319 (82%) | ||
| Good | 7 or greater | 217 (14%) | ||
| Total | - | 1599 |
| min | 25 percent | median | mean | 75 percent | max | std | |
|---|---|---|---|---|---|---|---|
| fixed.acidity | 4.60 | 6.80 | 7.50 | 7.87 | 8.40 | 12.50 | 1.65 |
| volatile.acidity | 0.23 | 0.56 | 0.68 | 0.72 | 0.88 | 1.58 | 0.25 |
| citric.acid | 0.00 | 0.02 | 0.08 | 0.17 | 0.27 | 1.00 | 0.21 |
| residual.sugar | 1.20 | 1.90 | 2.10 | 2.68 | 2.95 | 12.90 | 1.72 |
| chlorides | 0.04 | 0.07 | 0.08 | 0.10 | 0.09 | 0.61 | 0.08 |
| free.sulfur.dioxide | 3.00 | 5.00 | 9.00 | 12.06 | 15.50 | 41.00 | 9.08 |
| total.sulfur.dioxide | 7.00 | 13.50 | 26.00 | 34.44 | 48.00 | 119.00 | 26.40 |
| density | 0.99 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.00 |
| pH | 2.74 | 3.30 | 3.38 | 3.38 | 3.50 | 3.90 | 0.18 |
| sulphates | 0.33 | 0.50 | 0.56 | 0.59 | 0.60 | 2.00 | 0.22 |
| alcohol | 8.40 | 9.60 | 10.00 | 10.22 | 11.00 | 13.10 | 0.92 |
| quality | 3.00 | 4.00 | 4.00 | 3.84 | 4.00 | 4.00 | 0.37 |
| min | 25 percent | median | mean | 75 percent | max | std | |
|---|---|---|---|---|---|---|---|
| fixed.acidity | 4.70 | 7.10 | 7.80 | 8.25 | 9.10 | 15.90 | 1.68 |
| volatile.acidity | 0.16 | 0.41 | 0.54 | 0.54 | 0.64 | 1.33 | 0.17 |
| citric.acid | 0.00 | 0.09 | 0.24 | 0.26 | 0.40 | 0.79 | 0.19 |
| residual.sugar | 0.90 | 1.90 | 2.20 | 2.50 | 2.60 | 15.50 | 1.40 |
| chlorides | 0.03 | 0.07 | 0.08 | 0.09 | 0.09 | 0.61 | 0.05 |
| free.sulfur.dioxide | 1.00 | 8.00 | 14.00 | 16.37 | 22.00 | 72.00 | 10.49 |
| total.sulfur.dioxide | 6.00 | 24.00 | 40.00 | 48.95 | 65.00 | 165.00 | 32.71 |
| density | 0.99 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.00 |
| pH | 2.86 | 3.21 | 3.31 | 3.31 | 3.40 | 4.01 | 0.15 |
| sulphates | 0.37 | 0.54 | 0.61 | 0.65 | 0.70 | 1.98 | 0.17 |
| alcohol | 8.40 | 9.50 | 10.00 | 10.25 | 10.90 | 14.90 | 0.97 |
| quality | 5.00 | 5.00 | 5.00 | 5.48 | 6.00 | 6.00 | 0.50 |
| min | 25 percent | median | mean | 75 percent | max | std | |
|---|---|---|---|---|---|---|---|
| fixed.acidity | 4.90 | 7.40 | 8.70 | 8.85 | 10.10 | 15.60 | 2.00 |
| volatile.acidity | 0.12 | 0.30 | 0.37 | 0.41 | 0.49 | 0.92 | 0.14 |
| citric.acid | 0.00 | 0.30 | 0.40 | 0.38 | 0.49 | 0.76 | 0.19 |
| residual.sugar | 1.20 | 2.00 | 2.30 | 2.71 | 2.70 | 8.90 | 1.36 |
| chlorides | 0.01 | 0.06 | 0.07 | 0.08 | 0.08 | 0.36 | 0.03 |
| free.sulfur.dioxide | 3.00 | 6.00 | 11.00 | 13.98 | 18.00 | 54.00 | 10.23 |
| total.sulfur.dioxide | 7.00 | 17.00 | 27.00 | 34.89 | 43.00 | 289.00 | 32.57 |
| density | 0.99 | 0.99 | 1.00 | 1.00 | 1.00 | 1.00 | 0.00 |
| pH | 2.88 | 3.20 | 3.27 | 3.29 | 3.38 | 3.78 | 0.15 |
| sulphates | 0.39 | 0.65 | 0.74 | 0.74 | 0.82 | 1.36 | 0.13 |
| alcohol | 9.20 | 10.80 | 11.60 | 11.52 | 12.20 | 14.00 | 1.00 |
| quality | 7.00 | 7.00 | 7.00 | 7.08 | 7.00 | 8.00 | 0.28 |
set.seed(1985) #set a random number generation seed to ensure that the split is the same everytime
redwine_split <- createDataPartition(y =,redwine$quality,
p = 1298/1599, list = FALSE) # we put one less to make the training. We use 80% to estimate the value and 20 percent for training
training_redwine <- redwine[ redwine_split,]
testing_redwine<- redwine[ -redwine_split,]
The analysis process carried out is based on the 6-step process provided in class. We decided that we would use an implementation of Breiman and Cutler’s Random Forests for Classification and Regression. As a result we did not have to restrict ourselves to a binary dependent variable.
The first test we are going to do is the multinomial logistic regression. The idea is simple, we will try to derive the taste quality parameter based on the other 12 independent variables. A summary of the regression is shown below
## Call:
## multinom(formula = taste ~ fixed.acidity + volatile.acidity +
## citric.acid + residual.sugar + chlorides + free.sulfur.dioxide +
## total.sulfur.dioxide + density + pH + sulphates + alcohol,
## data = training_redwine)
##
## Coefficients:
## (Intercept) fixed.acidity volatile.acidity citric.acid
## faulty 405.6065 0.6609973 4.697646 0.9022834
## good 165.1007 0.2172616 -2.580776 0.5805793
## residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide
## faulty 0.3739949 7.039428 -0.025045965 -0.01196597
## good 0.2213255 -8.453456 0.008314928 -0.01831228
## density pH sulphates alcohol
## faulty -435.8021 7.1278877 -1.142488 -0.6364384
## good -179.8604 0.4315777 3.755149 0.7501815
##
## Std. Errors:
## (Intercept) fixed.acidity volatile.acidity citric.acid
## faulty 3.043451 0.16482994 0.9103453 1.3053491
## good 1.871254 0.08843828 0.8479439 0.9097131
## residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide
## faulty 0.08221751 3.288558 0.02391135 0.008781613
## good 0.06611148 3.531090 0.01358676 0.005692609
## density pH sulphates alcohol
## faulty 2.967102 1.5336428 1.3665900 0.18444246
## good 1.825386 0.9346179 0.5627972 0.09671016
##
## Residual Deviance: 1073.626
## AIC: 1121.626
| (Intercept) | fixed.acidity | volatile.acidity | citric.acid | residual.sugar | chlorides | free.sulfur.dioxide | total.sulfur.dioxide | density | pH | sulphates | alcohol | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| faulty | 405.6065 | 0.6609973 | 4.697646 | 0.9022834 | 0.3739949 | 7.039428 | -0.0250460 | -0.0119660 | -435.8021 | 7.1278877 | -1.142488 | -0.6364384 |
| good | 165.1007 | 0.2172616 | -2.580776 | 0.5805793 | 0.2213255 | -8.453456 | 0.0083149 | -0.0183123 | -179.8604 | 0.4315777 | 3.755149 | 0.7501815 |
As multinomial logistic regression does not provide the p-values, we will calculate them by normalizing the results. The calculated p-values are as follows:
| (Intercept) | fixed.acidity | volatile.acidity | citric.acid | residual.sugar | chlorides | free.sulfur.dioxide | total.sulfur.dioxide | density | pH | sulphates | alcohol | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| faulty | 0 | 0.0000607 | 0.0000002 | 0.4894273 | 0.0000054 | 0.0323078 | 0.2948916 | 0.1730033 | 0 | 0.0000034 | 0.4031472 | 0.0005593 |
| good | 0 | 0.0140240 | 0.0023379 | 0.5233433 | 0.0008147 | 0.0166654 | 0.5405461 | 0.0012961 | 0 | 0.6442469 | 0.0000000 | 0.0000000 |
The model summary output has a block of coefficients and a block of standard errors. Each of these blocks has one row of values corresponding to a model equation. Focusing on the block of coefficients, we can look at the first row comparing prog = “faulty” to our baseline prog = “average” and the second row comparing prog = “good” to our baseline prog = “average”. If we consider our coefficients from the first row to be b1 and our coefficients from the second row to be b2, we can write our model equations as follows
\[\ln(\dfrac{P(prog=faulty)}{P(prob=average)})=b_{10}+b_{11}+b_{12}+b_{13}\] \[\ln(\dfrac{P(prog=good)}{P(prob=average)})=b_{20}+b_{21}+b_{22}+b_{23}\] We can also use predicted probabilities to help you understand the model. We can calculate predicted probabilities for each of our outcome levels using the fitted function. We can start by generating the predicted probabilities for the observations in our dataset and viewing the first few rows
| average | faulty | good | Taste from original data (1 is average, 2 is faulty and 3 is good) |
|---|---|---|---|
| 0.8830922 | 0.1084340 | 0.0084737 | 1 |
| 0.9620570 | 0.0295885 | 0.0083545 | 1 |
| 0.9557823 | 0.0324463 | 0.0117714 | 1 |
| 0.9227038 | 0.0115456 | 0.0657506 | 1 |
| 0.8830922 | 0.1084340 | 0.0084737 | 1 |
| 0.9126397 | 0.0786353 | 0.0087250 | 1 |
| 0.9640966 | 0.0286594 | 0.0072439 | 1 |
| 0.8979686 | 0.0796937 | 0.0223377 | 3 |
| 0.9224107 | 0.0544786 | 0.0231107 | 3 |
| 0.9089993 | 0.0137815 | 0.0772192 | 1 |
We can predict the test values based on this regression:
| average | faulty | good | |
|---|---|---|---|
| average | 248 | 7 | 30 |
| faulty | 1 | 0 | 0 |
| good | 4 | 0 | 10 |
Based on the outputs, we have an missclassification error of 14 %. Also, we can plot the ROC curve. The ROC curve illustrates the performance of a binary classifier system as its discrimination threshold varies. The curve shows the true positive rate against the false positive rate at various threshold settings. The true-positive rate is also known as recall and the false-positive rate is also known as the fall-out or probability of false alarm.
The area under the curve is 88.0051764%.
Note: Display AUC value: 90+% - excellent, 80-90% - very good, 70-80% - good, 60-70% - so so, below 60% - not much value.
In this case, we use stepwise regression to find the best parameters. The summary of the simulation is as follows
| (Intercept) | fixed.acidity | volatile.acidity | residual.sugar | chlorides | total.sulfur.dioxide | density | pH | sulphates | alcohol | |
|---|---|---|---|---|---|---|---|---|---|---|
| faulty | 188.7181 | 0.5036951 | 4.426422 | 0.2642166 | 7.604152 | -0.0171398 | -214.0411 | 5.8123193 | -1.536922 | -0.4210022 |
| good | 240.5433 | 0.3228469 | -2.774771 | 0.2518946 | -7.486849 | -0.0156194 | -256.9891 | 0.7773722 | 3.821127 | 0.7003148 |
Again, we need to calculate the new p-values based on the previous results
| (Intercept) | fixed.acidity | volatile.acidity | residual.sugar | chlorides | total.sulfur.dioxide | density | pH | sulphates | alcohol | |
|---|---|---|---|---|---|---|---|---|---|---|
| faulty | 0 | 3.29e-04 | 2.00e-07 | 0.0013179 | 0.0192639 | 0.0047548 | 0 | 0.0001573 | 0.2662019 | 0.019449 |
| good | 0 | 1.14e-05 | 7.59e-05 | 0.0000831 | 0.0264939 | 0.0001211 | 0 | 0.3972321 | 0.0000000 | 0.000000 |
We can predict the test values based on the stepwise multimodal regression:
| average | faulty | good | |
|---|---|---|---|
| average | 246 | 7 | 30 |
| faulty | 1 | 0 | 0 |
| good | 6 | 0 | 10 |
Based on the outputs, we have an missclassification error of 14.6666667 %. Also, we can plot the ROC curve. The ROC curve illustrates the performance of a binary classifier system as its discrimination threshold varies. The curve shows the true positive rate against the false positive rate at various threshold settings. The true-positive rate is also known as recall and the false-positive rate is also known as the fall-out or probability of false alarm.
The area under the curve is 87.9951056%.
Note: Display AUC value: 90+% - excellent, 80-90% - very good, 70-80% - good, 60-70% - so so, below 60% - not much value.
With the simple multinomial regression taking into account all the variables, we are able to predict the quality of wine with a missclassification error of 14 %. Based on the p-values, the parameters that characterize a good wine are as follows:
Also, based on the results, the key differences of a good wine from a bad wine are found in the Total SO2, Sulphates and the Alcohol level. We will need topay attention to those values.
It is time now to run a classification algorithm on the data set. We have chosen to use the random forest tree algorithm for this.
First of all, we have a look at the confusion matrix results:
| Predicted average | Predicted faulty | Predicted good | Class error | |
|---|---|---|---|---|
| Actual average | 506 | 257 | 303 | 0.53 |
| Actual faulty | 15 | 35 | 6 | 0.38 |
| Actual good | 7 | 8 | 162 | 0.08 |
And now the same confusion matrix from a percentage in class perspective (rows sum too 100%)
| Predicted average | Predicted faulty | Predicted good | |
|---|---|---|---|
| Actual average | 47.47% | 24.11% | 28.42% |
| Actual faulty | 26.79% | 62.5% | 10.71% |
| Actual good | 3.95% | 4.52% | 91.53% |
This is how the error looks like:
After several trials, it seems that the error tends toget stabilized after the 80 trees. We have selected 128 trees.
For the final version of the model, we tried several combinations until we found the ones we like:
| Parameter | Value (average, faulty, good) |
|---|---|
| classwt | 10^{-5}, 1, 1 |
| sampsize | 56, 56, 56 |
| cutoff | 0.4, 0.3, 0.3 |
| mtry | 9 |
| ntree | 256 |
After running the predictionw with the testing sample, the following confusion matrix is obtained:
## Confusion Matrix and Statistics
##
## Reference
## Prediction average faulty good
## average 109 2 2
## faulty 64 3 0
## good 80 2 38
##
## Overall Statistics
##
## Accuracy : 0.5
## 95% CI : (0.442, 0.558)
## No Information Rate : 0.8433
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1985
##
## Mcnemar's Test P-Value : <2e-16
##
## Statistics by Class:
##
## Class: average Class: faulty Class: good
## Sensitivity 0.4308 0.42857 0.9500
## Specificity 0.9149 0.78157 0.6846
## Pos Pred Value 0.9646 0.04478 0.3167
## Neg Pred Value 0.2299 0.98283 0.9889
## Prevalence 0.8433 0.02333 0.1333
## Detection Rate 0.3633 0.01000 0.1267
## Detection Prevalence 0.3767 0.22333 0.4000
## Balanced Accuracy 0.6729 0.60507 0.8173
The ROC curve illustrates is as follows.
The area under the curve is 94.270975%.
Note: Display AUC value: 90+% - excellent, 80-90% - very good, 70-80% - good, 60-70% - so so, below 60% - not much value.
The model is strongly biased towards the faulty classification. The model is pretty good at predicting the good wines, and this is exacly what we are looking for. Although the model might seem to be terrible at dealing with the average wines (usually splitting them 47% average, 24% faulty, 28% good) we are not concerned with upward misclassifications. Also, with such an abundance of average wines, losing out on 47% of them is not so terrible. The main objective is to select the best wines to increase our revenues.
The variable importance is as follows:
| average | faulty | good | MeanDecreaseAccuracy | MeanDecreaseGini | |
|---|---|---|---|---|---|
| fixed.acidity | -1.1789124 | 0.4207999 | 4.259495 | -0.0313091 | 7.106187 |
| volatile.acidity | -5.5043443 | 18.3962683 | 15.377489 | 0.3575496 | 80.155634 |
| citric.acid | -2.5368459 | 1.5313713 | 7.474859 | 0.5444969 | 6.786713 |
| residual.sugar | 1.4540181 | 2.0167245 | 3.769658 | 2.9965290 | 11.190058 |
| chlorides | 1.5637435 | 2.1398075 | 5.241920 | 3.0042070 | 14.588990 |
| free.sulfur.dioxide | 0.8634829 | 9.3848467 | 5.325814 | 2.9541962 | 11.618034 |
| total.sulfur.dioxide | 4.7551936 | 8.4964902 | 11.070590 | 8.5796792 | 4.927110 |
| density | 2.1564976 | 2.7373943 | 5.800168 | 3.7947373 | 10.070354 |
| pH | -5.1593009 | 8.2437160 | 6.623883 | -2.8505398 | 5.564237 |
| sulphates | -8.4061864 | 18.3962171 | 29.648546 | 5.7976396 | 102.364189 |
| alcohol | 0.9503377 | 7.0952271 | 35.577178 | 19.1527096 | 57.833428 |
| And with the graphical | representation |
Based on the importance, the parameters that characterize a good wine are as follows:
We have run two different methods to find out the best variable predictors for our wine: Multilinear regression and Ramdon Forest. The outcome of the study is more or less similar.
On the one hand, the ** Multilinear regressesion** model is pretty good as we get a missclassification error of 14 %. This means that we are able to analyze the wines pretty well. Simulationr resuls also shown that these are the main important parameters to be considered when doing a good wine:
And we do not need to forget about the key differences of a good wine from a bad wine are found in the Total SO2, Sulphates and the Alcohol level. We will need topay attention to those values.
On the other hand, the RandomForest method is strongly biased towards the faulty classification. However, the model is pretty good at predicting the good wines, and this is exacly what we are looking for. Simulation results show that the parameters that characterize a good wine are as follows:
As it can be seen, both studies give similar results in terms of parameters. However, the RandomForest is not able to tell us if we need to increase or not these values to make a better wine.
And now comes the funny part, Do these parameters make sense? They do!!!!!
Total acidity in wine is known as titratable acidity, and is the sum of the fixed and volatile acids. Total acidity directly effects the color and flavor of wine and, depending on the style of the wine, is sought in a perfect balance with the sweet and bitter sensation of other components. The regression says that for a better taste the acidity of the wine has to be composed by a higher amount of fixed than volatile acids, that means a stronger sweet taste. This is aligned with what the regression states about residual sugar and chlorides, a wine has better quality with a higher proportion of sugar and a low proportion of salt.
Sulphates are a preservative that’s widely used in winemaking (and most food industries) for its antioxidant and antibacterial properties and they an important role in preventing oxidization and maintaining a wine’s freshness. But in high amounts, sulphates such as S02 can have a disgusting smell and taste, that is why the regression says that a certain increase of sulphates are good but also that an increase of S02 reduces the quality of the wine.
In wine, alcohol and density are negative correlated. While water has a density of 1 gram per cubic centimeter, alcohol has a density of about 0.79 g/cc, so the more alcohol vs other liquids a wine contains should decrease the overall density of the wine. This is aligned as well with the regression, showing that a better quality is associated with more alcohol, causing less density.
Good wine :)
You cannot buy happiness, … but you can buy a good wine and that is kind of the same thing.